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1. Introduction 

Charles Stein's crafty exploitation of the characterization 

X ~ 7V(0, 1) <S=^ E [f'(X) - Xf(X)] = for all bounded / G C\M.) (1.1) 

has given birth to a "method" which is now an acclaimed tool both in applied 
and in theoretical probability. The secret of the "method" lies in the structure 
of the operator T,f,f{x) := f'(x) — xf{x) and in the flexibility in the choice 
of test functions /. For the origins we refer the reader to [35, 36, 38]; for an 
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overview of the more recent achievements in this field we refer to the monographs 
[2, 5, 12, 28] or the review articles [27, 31]. 

Among the many ramifications and extensions that the method has known, 
so far the connection with information theory has gone relatively unexplored. 
Indeed while it has long been known that Stein identities such as (1.1) are related 
to information theoretic tools and concepts (see, e.g., [14, 20, 22]), to the best of 
our knowledge the only reference to explore this connection upfront is [3] in the 
context of compound Poisson approximation. In this paper and the companion 
paper [23] we extend Stein's characterization of the Gaussian (1.1) to a broad 
class of univariate distributions and, in doing so, provide an adequate framework 
in which the connection with information distances becomes transparent. 

The structure of the present paper is as follows. In Section 2 we provide the 
new perspective on the density approach from [37] which allows to extend this 
construction to virtually any absolutely continuous probability distribution on 
the real line. In Section 3 we exploit the structure of our new operator to de- 
rive a family of Stein identities through which the connection with information 
distances becomes evident. In Section 4 we compute bounds on the constants ap- 
pearing in our inequalities; our method of proof is, to the best of our knowledge, 
original. Finally in Section 5 we discuss specific examples. 



2. The density approach 

Let G be the collection of positive real functions x H> p(x) such that (i) their 
support S p := {x € R : p(x) (exists and) is positive} is an interval with closure 
S p = [a, b], for some — oo < a < b < oo, (ii) they are differentiable (in the usual 
sense) at every point in (a, b) with derivative x H> p'{x) := -£jp{y)\ y =x and 
(iii) J s p{y)dy = 1. Obviously, each p £ Q is the density (with respect to the 
Lebesgue measure) of an absolutely continuous random variable. Throughout 
we adopt the convention 

— 7 — \ if X £E Sn 

p{x) V 



p{x) \ otherwise; 

this implies, in particular, that p(x)/p(x) = Is p (x), the indicator function of the 
support Sp. As final notation, for p g Q we write E P [Z(X)] := J s l(x)p(x)dx. 

With this setup in hand we are ready to provide the two main definitions of 
this paper (namely, a class of functions and an operator) and to state and prove 
our first main result (namely, a characterization). 

Definition 2.1. To p 6 Q we associate (i) the collection J-(p) of functions 
f : M — ¥ M such that the mapping x <— ¥ f{x)p(x) is differentiable on the interior 
of S p and f(a + )p(a + ) = f(b~)p(b~) = 0, and (ii) the operator T p '■ F{p) — > R* : 
/ ^ Tpf defined through 



T p f : R R : x Tpf(x) := -L ±(f( y ) p ( y )) 

p(x) dy 



(2.1) 
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We call J-(p) the class of test functions associated with p, and T p the Stein 
operator associated with p. 

Theorem 2.1. Letp,q £ Q andletQ(b)= J b q(u)du. Then J^°° T p f(y)q(y)dy — 
for all f £ J^ip) if, and only if, q(x) = p(x)Q(b) for all x £ S p . 

Proof. If Q{b) = the statement holds trivially. We now take Q(b) > 0. To see 
the sufficiency, note that the hypotheses on /, p and q guarantee that 

f b d 

T P f(y)q(y)dy = Q(b) J —(f{u)p(u))\ u=y dy 

= Q(b) {f(b-)p(b-) - f(a + )p(a+)) = 0. 

To see the necessity first note that the condition J M T p f(y)q(y)dy = implies 
that the function y i— > T p f(y)q(y) be Lebesgue-integrable. Next define for zel 
the function 

l z (u) := (l M {u) - P(z))I Sp (u) 
with P(z) := f p(u)du, which satisfies 

l z (u)p(u)du = 0. 



Then the function 



b 



f " ix) := pjx) I l ^ u ^ du [= -jffi J x Uu)p{u)duj 

belongs to JF{p) for all z and satisfies the equation 

T p f p z {x) = l z {x) 
for all x € S p . For this choice of test function we then obtain 

T P f p z {y)q{y)dy = / l z {y)q{y)dy = (Q(z) - P(z)Q(b))I Sp (z), 

with Q(z) :~ J a q(u)du. Since this integral equals zero by hypothesis, it follows 
that Q(z) ~ P(z)Q(b) for all z e S p , hence the claim holds. □ 

The above is, in a sense, nothing more than a peculiar statement of what is 
often referred to as a "Stein characterization". Within the more conventional 
framework of real random variables having absolutely continuous densities, The- 
orem 2.1 reads as follows. 

Corollary 2.1 (The density approach). Let X be an absolutely continuous 
random variable with density p £ Q . Let Y be another absolutely continuous 
random variable. Then E [T p f{Y)] = for all f £ F{p) if, and only if, either 
P(Y £ S p ) = or P(Y eS p )>0 and 

P(Y < z\Y £ Sp) = P(X < z) 

for all z £ S p . 
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Corollary 2.1 extends the density approach from [37] or [11, 12] to a much 
wider class of distributions; it also contains the Stein characterizations for the 
Pearson given in [32] and the more recent general characterizations studied in 
[15, 18]. There is, however, a significant shift operated between our "derivative 
of a product" operator (2.1) and the standard way of writing these operators in 
the literature. Indeed, while one can always distribute the derivative in (2.1) to 
obtain (at least formally) the expansion 

T P f(x) = (/'(*) + j^m) Is», I 2 - 2 ) 

the latter requires / be differentiable on S p in order to make sense. We do not 
require this, neither do we require that each summand in (2.2) be well-defined 
on S p nor do we need to impose integrability conditions on / for Theorem 2.1 
(and thus Corollary 2.1) to hold! Rather, our definition of F(p) allows to iden- 
tify a collection of minimal conditions on the class of test functions / for the 
resulting operator T v to be orthogonal to p w.r.t. the Lebesgue measure, and 
thus characterize p. 

Example 2.1. Take p = <j), the standard Gaussian. Then T((j)) is composed of 
all real-valued functions f such that (i) x i— > f(x)e~ x I 2 is differentiable on R 
and (ii) lim x ^ f ± 00 f(x)e~ x I 2 = 0. In particular .F(0) contains the collection of 
all differentiable bounded functions and 

7}/(s) = /'(*)- */(*), 

which is Stein's well-known operator for characterizing the Gaussian (see, e.g., 
[2, 12, 35]). There are of course many other subclasses that can be of interest. 
For example the class J-(4>) also contains the collection of functions f(x) — 
— fg(x) with /o a twice differentiable bounded function; for these we get 

T<pf(x) = xfo(x) - fa(x), 

the generator of an Ornstein-Uhlenbeck process, see [4, 19, 28]. The class ?(<))) 
as well contains the collection of functions of the form f(x) = H„ (x)fo(x) for 
H n the n-th Hermite polynomial and /o any differentiable and bounded function. 
For these f we get 

T$f(x) = H n (x)fa(x) - H n+1 (x)f (x), 
an operator already discussed in [17] (equation (38)). 

Example 2.2. Take p = Exp the standard rate-one exponential distribution. 
Then J- (Exp) is composed of all real-valued functions f such that (i) x 
f(x)e~ x is differentiable on (0, +oo), (ii) /(0) = and (Hi) lim a; _>._(- 00 f(x)e~ x = 
0. In particular T (Exp) contains the collection of all differentiable bounded func- 
tions such that f(0) = and 

T E x P f(x) = (f'(x) - f(x))I [Qt0o) (x), 
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the operator usually associated to the exponential, see [25, 29, 37]. The class 
J 7 (Exp) also contains the collection of functions of the form f(x) = xfa(x) for 
/o any differ entiable bounded function. For these f we get 

T Exp f(x) = (xfo(x) + (1 - a:)/o(x))I[o,oo)(aO) 
an operator put to use in [10]. 

Example 2.3. Finally take p = Beta(a, f3) the beta distribution with param- 
eters (a, ft) £ M.q x Kg". Then JF(Beta(a, /?)) is composed of all real-valued 
functions f such that (i) x i— > f(x)x a ~ 1 (l — x)* 3-1 is differ entiable on (0,1), 
(ii) lim-E^o f(x)x a - 1 (l - xf- 1 = and (in) lim^i f(x)x a ' 1 (l - x)^ 1 = 0. 
In particular J-(Beta(a, (3)) contains the collection of functions of the form 
f(x) = (x(l — x))fo(x) with fo any differentiable bounded function. For these f 
we get 

T Be ta{a,0)f(x) = ((a(l -x)- f3x) f (x) + x(l - x)fo(x))l[ ,i](x), 
an operator recently put to use in, e.g., [15, 18]. 

There are obviously many more distributions that can be tackled as in the 
previous examples (including the Pearson case from [32]), which we leave to the 
interested reader. 



3. Stein-type identities and the generalized Fisher information 
distance 



It has long been known that, in certain favorable circumstances, the properties 
of the Fisher information or of the Shannon entropy can be used quite effectively 
to prove information theoretic central limit theorems; the early references in this 
vein are [6, 7, 24, 33]. Convergence in information CLTs is generally studied in 
terms of information (pseudo-)distances such as the Kullback-Leibler divergence 
between two densities p and q, defined as 



dKh(p\\q) = E 9 
or the Fisher information distance 



J(<f>,q)=E q 



x 



q'(x) 

q(X) 



(3.1) 



(3.2) 



which measures deviation between any density q and the standard Gaussian 
(f>. Though they allow for extremely elegant proofs, convergence in the sense of 
(3.1) or (3.2) results in very strong statements. Indeed both (3.1) and (3.2) are 
known to dominate more "traditional" probability metrics. More precisely we 
have, on the one hand, Pinsker's inequality 



1 



dTv(p,q) < —^=\/d KL (p\\q), 



(3.3) 
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for d,Tv(p,q) the total variation distance between the laws p and q (see, e.g., 
[16, p. 429]), and, on the other hand, 



for d L i(cj),q) the L 1 distance between the laws (f> and q (see [21, Lemma 1.6]). 
These information inequalities show that convergence in the sense of (3.1) or 
(3.2) implies convergence in total variation or in L , for example. Note that 
one can further use De Brujn's identity on (3.3) to deduce that convergence in 
Fisher information is itself stronger than convergence in relative entropy. 

While Pinskcr's inequality (3.3) is valid irrespective of the choice of p and q 
(and enjoys an extension to discrete random variables), both (3.2) and (3.4) are 
reserved for Gaussian convergence. Now there exist extensions of the distance 
(3.2) to non-Gaussian distributions (see [3] for the discrete case) which, as could 
be expected, have also been shown to dominate the more traditional probability 
metrics. There is, however, no general counterpart of Pinsker's inequality for 
the Fisher information distance (3.2); at least there exists, to the best of our 
knowledge, no inequality in the literature which extends (3.4) to a general couple 
of densities p and q. 

In this section we use the density approach outlined in Section 2 to con- 
struct Stein-type identities which provide the required extension of (3.4). More 
precisely, we will show that a wide family of probability metrics (including the 
Kolmogorov, the Wasserstein and the L 1 distances) is dominated by the quan- 
tity 



Our bounds, moreover, contain an explicit constant which will be shown in 
Section 4 to be at worst as good as the best bounds in all known instances. In 
the spirit of [3] we call (3.5) the generalized Fisher information distance between 
the densities p and q, although here we slightly abuse of language since (3.5) 
rather defines a pseudo-distance than a bona fide metric between probability 
density functions. 

We start with an elementary statement which relates, for p ^ q, the Stein 
operators T v and T q through the difference of their respective score functions — 



Lemma 3.1. Let p and q be probability density functions in Q with respective 
supports S p and S q . Let S q C S p and define 



Suppose that F(p) n ^ 0- Then, for all f £ F(p) H jF{q), we have 



d L i(cf>,q)<V2y/j(frq~) 



(3.4) 




(3.5) 



and 




T P f(x) = T q f(x) + f(x)r(p,q)(x)+T p f(x)I 3 ,\sM 



and therefore 



E q [T p f(X)]=E q [f(X)r(p 7 q)(X)}. 



(3.6) 
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Proof. Splitting S p into S q U {S p \ S q }, we have 

f(y)p(y) = f(y)q(y)p(y)/q(y)Xs,(y) + f(y)p(y)%s p \s v (y) 

for any real- valued function /. At any x in the interior of S p we thus can write 
T P f(x) 

±{f{y)q{y)p{y)/q{y)) 



y=x 



p(x) 



Is q (x)+T p f(x)l Sp \ Sq (x) 



i(f(y)q(y)) v(x) i(p(y)/q(y)) 

V=X P{X) + f(x)q(x) , , *=* + T P f(x)I SASq (x) 



p(x) q(x) p(x) 

+ T P f(x)l SASq (x). 



T q f(x) + f(x)^±(p(y)/q(y)) 
p(x) dy 



The first claim readily follows by simplification, the second by taking expecta- 
tions under q which cancels the first term T q f(x) (by definition) as well as the 
third term T p f(x)Ig p \s q (x) (since the supports do not coincide). □ 

Remark 3.1. Our proof of Lemma 3.1 may seem circumvoluted; indeed a much 
easier proof is obtainable by writing T p under the form (2.2). We nevertheless 
stick to the "derivative of a product" structure of our operator because this dis- 
penses us with superfluous - and, in some cases, unwanted - differentiability 
conditions on the test functions. 

From identity (3.6) we deduce the following immediate result, which requires 
no proof. 

Lemma 3.2. Let p and q be probability density functions in Q with respective 
supports S q C S p . Let I be a real-valued function such that ~E p [l(X)] and E q [l(X)] 
exist; also suppose that there exists f G F{p) n T{q) such that 

T P f(x) = (l(x) - E p [l(X)])I Sp (x); (3.7) 

we denote this function ff . Then 

E q [l(X)} - E p [l(X)} = E q [tf(X)r(p,q)(X)]. (3.8) 

The identity (3.8) belongs to the family of so-called "Stein-type identities" 
discussed for instance in [1, 8, 17]. In order to be of use, such identities need to 
be valid over a large class of test functions I. Now it is immediate to write out 
the solution ff of the so-called "Stein equation" (3.7) explicitly for any given 
p and I; it is therefore relatively simple to identify under which conditions on / 
and q the requirement ff € F(q) is verified (since ff £ F(p) is anyway true). 

Remark 3.2. For instance, for p = </> the standard Gaussian, one easily sees 
that lim x _ y ± 00 ff (x) = 0, hence, when S q = Sj, = R, q only has to be (differ- 
entiable and) bounded for ff to belong to F{q). However, when S q C R, then q 
has to satisfy, moreover, the stronger condition of vanishing at the endpoints of 
its support S q since ff needs not equal zero on any finite points in R. 
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We shall see in the next section that the required conditions for ff £ J~(q) are 
satisfied in many important cases by wide classes of functions /. The resulting 
flexibility makes (3.8) a surprisingly powerful identity, as can be seen from our 
next result. 

Theorem 3.1. Let p and q be probability density functions in Q with respective 
supports S q C S p and such that T{p) PI J~(q) ^ 0- Let 

dn(p,q) = sup \E q [l(X)]-E p [l(X)}\ (3.9) 
leu 

for some class of functions 7-L. Suppose that for all I £ TL the function ff , as 
defined in (3.7), exists and satisfies ff £ J-(p) H J~{q). Then 



where 



and 



dn(p,q) < K^J{p,q), 

^ = su Pv /E g [(/f(X))2] 
leu v 

J(p,q)=E q [(r(p,q)(X)) 2 }, 



(3.10) 
(3.11) 
(3.12) 



the generalized Fisher information distance between the densities p and q. 

This theorem implies that all probability metrics that can be written in the 
form (3.9) are bounded by the generalized Fisher information distance J(j>, q) 
(which, of course, can be infinite for certain choices of p and q). Equation (3.10) 
thus represents the announced extension of (3.4) to any couple of densities (p, q) 
and hence constitutes, in a sense, a counterpart to Pinskcr's inequality (3.3) for 
the Fisher information distance. We will see in Section 5 how this inequality 
reads for specific choices of H, p and q. 

4. Bounding the constants 

The constants in (3.11) depend on both densities p and q and therefore, to 
be fair, should be denoted K^f ■ Our notation is nevertheless justified because 
we always have 

4 < sup Halloo, (4.1) 
leu 

where the latter bounds (sometimes referred to as Stein factors or magic factors) 
do not depend on q and have been computed for many choices of % and p. 
Consequently, is finite in many known cases - including, of course, that of 
a Gaussian target. 

Example 4.1. Take p = (f>, the standard Gaussian. Then, from (4.1), we 
get the bounds (i) < \/ ir/2 for TL the collection of Borel functions in 
[0,1] (see [28, Theorem 3.3.1]); (ii) < -\/27r/4 for % the class of indica- 
tor functions for lower half-lines (see [28, Theorem 3.4-2]); and (Hi) < 
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sup ;eW min (||Z — E p ||oo, 2||/'|| 00 ) for H the class of absolutely con- 

tinuous functions on R (see [13, Lemma 2.3]). See also [2, 12, 28, 30] for more 
examples. 

Bounds such as (4.1) are sometimes too rough to be satisfactory. We now 
provide an alternative bound for which, remarkably, improves upon the best 
known bounds even in well-trodden cases such as the Gaussian. We focus on 
target densities of the form 

p(x) = ce" d|a;|Q I s (a;), a > 1, (4.2) 

with 5* a scale-invariant subset of R (that is, cither R or the open/closed pos- 
itive/negative real half lines), d > some constant and c the appropriate nor- 
malizing constant. The exponential, the Gaussian or the limit distribution for 
the Ising model on the complete graph from [11] are all of the form (4.2). Of 
course, for S* = R, (4.2) represents power exponential densities. 

Theorem 4.1. Take p S Q as in (4.2) and q € Q such that S q = S . Consider 
h : R — > R some Borel function with p-mean E p [h(X )] = 0. Let be the unique 
bounded solution of the Stein equation 

T P f(x) = h(x). (4.3) 

Then 



E 9 



{f P h {X)f \ < =. (4.4) 

Proof. Under the assumption that E p [/i(X)] = 0, the unique bounded solution 
of (4.3) is given by 




h{y)p(y)dy if x < 0, 
Hy)p{y)dy if x > 0, 



P(x). 

the function being, of course, put to if a; is outside the support of p. Then 
E 9 [Ul(X)) 2 } = fj(x) ( JL J 3 ' h(y)p(y)dy} ' dx 

roo / poo \ 2 

+ J q(x) y-^j J Hy)p(y) d y) dx 

=:/"+/+ 

where I~ = (resp., J+ = 0) if S = R+ (resp., S = IT). 

We first tackle /~. Setting p(x) = ce~ d ^ x ^Is(x) and using Jensen's inequality, 
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we get 



— oo 




q(x) (e d W J h(u)e- dHa du^j dx 



,d\x\ c 



\h(u)\ e - dlur du ) dx 



< / q(x) e 



< q(x)(e 2d W h 2 (u)e- 2dluia du)dx 



2 l/a 



q(x) e 2d ^" / h 2 (u/2 1/a )e- d ^du dx, 



where the last equality follows from a simple change of variables. Applying 
Holder's inequality we obtain 



7" < 



1/2 
7? 



2V°\J_ 



/ r 2i/"x \ z 

q(x) e 2d H a / h 2 {u/2 l / a )e- d \ u \ a du dx =: 7f , 



where 7^ = P q (X < 0) := f_ q(x)dx. Repeating the Jensen's inequality-change 
of variables-Holder's inequality scheme once more yields 



I' < Iy < /a" 



with 



T- - -ll 

2 2^Ci+i) 



g(x) e 



4<7 1 . 



(21/^)5 



e- d|tl| °dM dx 



Iterating this procedure m G N times wc deduce 

/- < /f < ■ ■ ■ < r~ 

with I~ given by 



7g ' 

2iJV(m) 



q(x) e 2md l*l° 



(2 I/a ) m « 



( 2 l/a)r 



v 2 \ 2™" 

e- d ^ a du) dx 



where iV(m) = 1 + 5 + 3 
simplifies the above into 



2^. Bounding h 2 



(2V0 



by (Halloo) 2 



C. Ley and Y. Swan/Stein's density approach and information inequalities 11 

Since the mapping y H ► 77(7/) := e 1 ^ e - d M attains its maximal value 

at for a > 1 (indeed, 



t/(j/) = 1 - e^dalyr" 1 / e- d l u l a du 

*/ — OO 

>l_ e «W" f V dalu^e-^du^O, 



hence r\ is monotone increasing) , the interior of the parenthesis becomes 
q{x)\e 2md ^j e-^du) dx< I q(x)-^dx=-[§. 



Note that here we have used, for any support S, J_ ce _d '"'° du < 1. Elevated 
to the power l/(2m), this factor tends to 1 as m -> oo. Since we also have 
lim m _ i . 0O N(m) — 2 we finally obtain 



7- < Urn J" < ^fiL p g (X < 0). 



Tn— too 



Similar manipulations allow to bound I + by QlM^aj P q (X > 0). Combining 
both bounds then allows us to conclude that 



E„ [(#(*))»]< 



2i ' 



hence the claim holds. □ 

This result of course holds true without worrying about € F{q). However, 
in order to make use of these bounds in the present context, the latter condition 
has to be taken care of. For densities of the form (4.2), one easily sees that 
f? G J~(q) for all (diffcrentiablc and) bounded densities q for a > 1, with the 
additional assumption, for a = 1, that lim x ^± 00 q(x) = 0. 

Example 4.2. Take p = <f>, the standard Gaussian. Then, from (4.4), 

K^i < —7= sup ||Z — E^ Hoc. (4.5) 

V2 ;e« 

Comparing with the bounds from Example J^.l we see that (4.5) significantly 
improves on the constants in cases (i) and (Hi); it is slightly worse in case (ii). 

5. Applications 

A wide variety of probability distances can be written under the form (3.9). For 
instance the total variation distance is given by 



d-Tv(p,q) = sup 

ACS. 



{p(x) - q{x))dx 



A 



\ sup \E p [h(X)]-E q [h(X)}\ 
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with Wb[-i,i] the class of Borel functions in [—1, 1], the Wasserstein distance is 
given by 

d w ( P ,q)= sup \E p [h(X)]-E q [h(X)}\ 

freWLipi 

with T^Lipi the class of Lipschitz-1 functions on R and the Kolmogorov distance 
is given by 



d Ko i(p, q) = sup 



(p(x) - q(x))dx 



sup |Ep [h(X)} - E q [h(X)] 

h€H HL 



with T-Lhl the class of indicators of lower half lines. We refer to [16] for more 
examples and for an interesting overview of the relationships between these 
probability metrics. 

Specifying the class T~L in Theorem 3.1 allows to bound all such probability 
metrics in terms of the generalized Fisher information distance (3.12). It remains 
to compute the constant (3.11), which can be done for all p of the form (4.2) 
through (4.4). The following result illustrates these computations in several 
important cases. 

Corollary 5.1. Take p € Q as in (4.2) and q G Q such that S q = S. For 
a > 1, suppose that q is (differentiable and) bounded over S; for a = 1, assume 
moreover that q vanishes at the infinite endpoint(s) of S. Then we have the 
following inequalities: 



1. 
2. 
3. 



d T v{p,q)<2-i^/J{p,q) 

dKo\{p,q) < T^\J 3(j>, q) 
sup; eKl4pl ||J-E p [ZpO]|| c 



dw(p,q) < L ' pl i \J J(p, q) 



d L i(p,q)= / \p(x)-q(x)\dx<2 1 -^^/j(p,q). 

If, for all y £ S , q is such that the function ff(x) = e d ' x ' (Ir^M (x) — P{x)), 
where P denotes the cumulative distribution function associated with p, belongs 
to J-'(q), then 

d sup (p,q) = sup \p(x) - q(x)\ < ^/J{p,q). 

Proof. The first three points follow immediately from the definition of the dis- 
tances and Theorems 3.1 and 4.1. To show the fourth, note that 



s 



\p{x) - q{x)\dx = E p [l(X)} - E q [l{X)} 
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for l(u) = I[p(u)>g( tt )] -\(u)>p(u)] = 2I b(«) ><?(«)] ~ L For tne last case note tnat 

4u P (p, «) := sup \p(y) - q(y)\ = sup \E p [l y {X) - E q [l y (X)}\ 
yes yes 

for l y (x) = (5^=^} the Dirac delta function in y £ S. The computation of the 
constant «^ in this case requires a different approach from our Theorem 4.1. 
We defer this to the Appendix. □ 

We conclude this section, and the paper, with explicit computations in the 
Gaussian case p = <f>, hence for the classical Fisher information distance. From 
here on we adopt the more standard notations and write J(X^) instead of 
q), for X a random variable with density q (which has support R). Im- 
mediate applications of the above yield 



100) - q(x)\ dx < y/2y/j(X), 

which is the second inequality in [21, Lemma 1.6] (obtained by entirely different 
means). Similarly we readily deduce 



sup -g(x)| < VJiX); 



this is a significant improvement on the constant in [21, 33]. 

Next further suppose that X has density q with mean \i and variance a 2 . 
Take Z ~ p with p = 4>^, QM 2 j the Gaussian with mean and variance a 2 . Then 



J{X) = Eg 



q'{X) X — fip 
q(X) + a 2 



where I(X) = E q Uq' (X) / q(X)) 2 ] is the Fisher information of the random 
variable X. General bounds are thus also obtainable from (3.10) in terms of 

. ( UL /in) 2 1 / O 2 



"0 "0 \°0 

and the quantity 

T(X) = I(X)-\, 

°0 

referred to as the Cramer-Rao functional for q in [26]. In particular, we deduce 
from Theorem 4.1 and the definition of the total variation distance that 



This is an improvement (in the constant) on [26, Lemma 3.1], and is also related 
to [9, Corollary 1.1]. Similarly, taking H the collection of indicators for lower 
half lines we can use (4.1) and the bounds from [13, Lemma 2.2] to deduce 
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Further specifying q 



6„, „2 we see that 



ctovT(T)+¥< 



|°i - CT o| . iMi - A f o| 



CO 



to be compared with [28, Proposition 3.6.1]. Lastly take Z ~ </> the standard 

Gaussian and X = for F some monotone increasing function on K such 

that / = F' is defined everywhere. Then straightforward computations yield 



/(X) = E 



V>/(Z) 



with ipf = (log/)'. In particular, if F is a random function of the form F(x) = 
Yx for Y > some random variable independent of Z, then simple conditioning 
shows that the above becomes 



I(X) = E 





"Z 2 " 




' 1 " 


E 




= E 




Y 2_ 


Y 2 



so that 



(fav(4>,qx) < -^=WE 



y2 



1 + E(T 2 - 1) 



where qx refers to the density of X = YZ. This last inequality is to be compared 
with [9, Lemma 4.1] and also [34]. 



Appendix A: Bounds for the supremum norm 

First note that, for l y (x) = 5i x=y y, the solution f[(x) of the Stein equation 
(3.7) is of the form 



(<5{ 2= „} -p{y))p{z)dz 



p(y)(l [y;b) (x) - P(x)) 



P(X) J a P(x) 

For all densities q such that ff (x) £ F(q), Theorem 3.1 applies and yields 

sup y£S \p(y) - q(y)\ < sup y£S p(y)^E q [(l M (X) - P{X)f / {p{X)f] y/jfaq), 
where b is either or +oo. We now prove that 



sup yeS p(y)^/E q [(I M (X) - P{X)) 2 /{p{X)f] < 1 
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for p(x) = ce d \ x \ and any density q satisfying the assumptions of the claim. 
To this end note that straightforward manipulations lead to 



E q [(l M (X)-P(X)) 2 /(p(X)f] 



1 ' b 



= 4 I" q(x)e 2dlxr (P(x) fdx + ^ f q{x)e 2d ^ a {l-P{x)fdx 

C J a C J y 

< J_ e 2<%r ( p (y)) 2 f V q(x)dx + }_ e 2d\yr {l _ p{y)) 2 f b q(x)dx 
C J a C J y 

= l- e Mvr {P{y) f + le 2 ^l°(l - 2P(y))P q (X > y), 

where the inequality is due to the fact that e M ' I '°P(i) (resp., e 2d ' x '° (1 — P(x))) 
is monotone increasing (resp., decreasing) on (a,y) (resp., (y, &)); see the proof 
of Theorem 4.1. This again directly leads to 

E q [(l M (X)-P(X)) 2 /(p(X)f] 



< sup \ce- d \y\\l\e™\y\«{{P{y)y + {l-2P(y))¥ q {X>y) 

ye(a,b) 



= sup 

ye(a,b) 



v /(P(y)) 2 + (l-2P(y))P 9 (X>y) 
This last expression is equal to 1. 
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